feat: async Python client for non-blocking sandbox operations#510
feat: async Python client for non-blocking sandbox operations#510AndyJCai wants to merge 5 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
Welcome @AndyJCai! |
|
Hi @AndyJCai. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
147f81f to
812d058
Compare
Add AsyncSandboxClient and supporting async modules so users can await sandbox creation, command execution, and file I/O without blocking the event loop. This is needed for async frameworks like FastAPI, aiohttp, and async agent orchestrators. Key design decisions: - Explicit retry logic matching sync client's status-code retries - asyncio.Lock guards on shared state for coroutine safety - BaseException handling to catch CancelledError during create - try/finally on all K8s watch streams to prevent leaked connections - Required connection_config (no LocalTunnel support in async) - Lazy __getattr__ import with actionable ImportError message Optional deps: pip install k8s-agent-sandbox[async] Supersedes kubernetes-sigs#256.
812d058 to
0156958
Compare
- Clear _active_connection_sandboxes after closing connections to prevent stale references - Set httpx.AsyncClient timeout to 60s matching sync client behavior (httpx defaults to 5s) Made-with: Cursor
|
/ok-to-test |
|
/assign @SHRUTI6991 |
Made-with: Cursor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: AndyJCai The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test presubmit-agent-sandbox-e2e-test |
- Remove dead retry branch in async connector (retryable status codes are handled before raise_for_status, so the except handler never sees them) - Narrow except BaseException to except (Exception, CancelledError) to avoid catching KeyboardInterrupt/SystemExit - Replace logging.basicConfig() with logging.getLogger(__name__) so library code doesn't hijack the root logger - Use built-in generics (list, dict, tuple) instead of typing imports since project requires Python 3.10+ - Replace __getattr__ lazy import with global try/except pattern - Add test for close() clearing the sandbox registry Made-with: Cursor
- Make fallback AsyncSandboxClient a class so isinstance() works - Use logger = logging.getLogger(__name__) in async_connector and async_k8s_helper (was still using root logger) - Add None event guard to resolve_sandbox_name watch loop to prevent TypeError crash on None events - Document no-atexit behavior in AsyncSandboxClient docstring Made-with: Cursor
| self.k8s_helper = k8s_helper | ||
|
|
||
| self._base_url: str | None = None | ||
| self.client = httpx.AsyncClient(timeout=httpx.Timeout(60.0)) |
There was a problem hiding this comment.
The PR description mentions relying on transport retries for connection errors. However, httpx.AsyncClient defaults to 0 retries. To enable transport-level retries for connection errors, you should explicitly configure the transport.
| return response | ||
| except httpx.HTTPStatusError as e: | ||
| logger.error(f"Request to sandbox failed: {e}") | ||
| raise SandboxRequestError( |
There was a problem hiding this comment.
Because both httpx.HTTPStatusError and httpx.RequestError (like ConnectError) inherit from httpx.HTTPError, any network error will be caught here and raise SandboxRequestError immediately without retrying. If you don't use AsyncHTTPTransport(retries=N), network errors will fail on the first attempt.
| w = watch.Watch() | ||
| logger.info(f"Resolving sandbox name from claim '{claim_name}'...") | ||
| try: | ||
| async for event in w.stream( |
There was a problem hiding this comment.
The timeout_seconds parameter tells the Kubernetes API server when to close the stream, but the stream can also drop prematurely due to network issues or API server timeouts. If this happens, the loop exits normally and erroneously raises a TimeoutError even if the time elapsed is short.
|
|
||
| async def __aexit__(self, exc_type, exc_val, exc_tb) -> None: | ||
| try: | ||
| await self.delete_all() |
There was a problem hiding this comment.
delete_all closes the connection too.. why are we having an addition close method check?
Summary
Adds
AsyncSandboxClientfor non-blocking sandbox operations usinghttpxandkubernetes_asyncio. Supersedes #256.Credit: Builds on @raceychan's #256 — the core idea, dependency choices, and API shape originate from that PR. This version rebases onto the current modular
k8s_agent_sandboxstructure and incorporates @SHRUTI6991's review feedback (deduplication, tests).Design decisions
connection_configrequiredkubectl port-forward), which is inherently synchronous. Fail fast instead of silently pointing at localhost.except BaseExceptionin create cleanupasyncio.CancelledErroris aBaseException. Without this, cancelled tasks leak orphaned K8s claims.asyncio.Lockon shared stateawaitpoints. Guards check-then-act on_active_connection_sandboxes.try/finallyon watch streamsw.close()runs on exceptions/cancellation, preventing leaked sessions.httpx.AsyncHTTPTransport(retries=N)only retries connection errors, not 500/502/503/504. Manual loop matches sync client behavior.AI disclosure
This PR was prepared with AI tooling (Cursor). All changes were reviewed, understood, and verified by the author.